Guide on optimizing Parquet file reading to reduce memory usage #594
Guide on optimizing Parquet file reading to reduce memory usage #594haianhng31 wants to merge 3 commits intoG-Research:masterfrom
Conversation
adamreeve
left a comment
There was a problem hiding this comment.
Nice start thanks @haianhng31
Can you also please add a link to the new guide as well as the visitor pattern one to the list at the bottom of index.md?
|
|
||
| APIs for reading Parquet files: | ||
| 1. **LogicalColumnReader API** - Column-oriented reading with type-safe access | ||
| 2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format |
There was a problem hiding this comment.
The Arrow format is still column-oriented
| 2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format | |
| 2. **Arrow API (FileReader)** - Reading using Apache Arrow's in-memory format |
|
|
||
| Each API offers different memory management options that impact memory usage. | ||
|
|
||
| ## Memory Configuration Parameters |
There was a problem hiding this comment.
This should include a section on the buffered stream parameter (ReaderProperties.EnableBufferedStream). Maybe this could be combined with the Buffer Size section as the buffer size is only used when the buffered stream is enabled? It would be helpful to also link to the documentation for the relevant methods for setting each parameter.
| ### 1. Buffer Size | ||
| Controls the size of I/O buffers used when reading from disk or streams. | ||
|
|
||
| **Default**: 8 MB (8,388,608 bytes) when using default file reading |
There was a problem hiding this comment.
I think the default is actually 16384, where did you get 8 MB from?
| **Impact**: Larger buffers reduce I/O operations but increase memory usage. Smaller buffers are more memory-efficient but may decrease throughput. | ||
|
|
||
| ### 2. Chunked Reading | ||
| Instead of loading entire columns into memory, read data in smaller chunks. |
There was a problem hiding this comment.
This could do with some clarification. Is this referring to using the LogicalColumnReader API and controlling buffer/chunk sizes yourself?
| **Impact**: Pre-buffering can significantly increase memory usage as it loads data from future row groups before they're needed. This is the primary cause of memory usage scaling with file size reported in Apache Arrow [issue #46935](https://github.com/apache/arrow/issues/46935). | ||
|
|
||
| ### 4. Cache (Arrow API Only) | ||
| The Arrow API uses an internal `ReadRangeCache` that stores buffers for column chunks. |
There was a problem hiding this comment.
I think this can be merged into number 3, as the cache options only apply when using pre-buffering and are used to configure the pre-buffering behaviour.
| for (int col = 0; col < metadata.NumColumns; col++) | ||
| { | ||
| using var columnReader = rowGroupReader.Column(col); | ||
| using var logicalReader = columnReader.LogicalReader<float>(); |
There was a problem hiding this comment.
I think it's worth pointing out that ParquetSharp has its own buffering in the LogicalReader API, and this can be configured with the bufferLength parameter of this method.
| { | ||
| // Use a buffered stream with custom buffer size (1 MB in this example) | ||
| using var fileStream = File.OpenRead(filePath); | ||
| using var bufferedStream = new BufferedStream(fileStream, bufferSize); |
There was a problem hiding this comment.
By using a buffered stream, I actually meant enabling it in the ReaderProperties with ReaderProperties.EnableBufferedStream.
In previous investigations I've found that this can significantly reduce memory usage.
I don't think using a .NET System.IO.BufferedStream will change memory usage characteristics much.
| - **Columns**: 10 float columns | ||
| - **Rows**: 100 million (1 million per row group) | ||
| - **Compression**: Snappy | ||
| - **Test System**: MacBook (*Note: real-world performance may vary depending on your Operating System, environment*) |
|
Closing this as superseded by #611 |
No description provided.